R Project

The data is a made of three sub-datasets, one looks at students aged 15 and their proficiency in reading, mathematical and scientific literacy, also contains details about their socio-economic background such as parents’ education, wealth score, access to computers and internet and if they share or have their own room. The second subset is schools, which contains info about the school funding what proportion comes from government, fees or donations, the number of students enrolled and whether it is private or public, lastly is the country subset containing the name of participating countries and their associated three letter code. The data has been collected every three years starting in 2000 and the last available year is 2018. The data is available in the learningtower CRAN package.
library(learningtower)
library(tidyverse)
library(maps)
library(ggmap)
library(plotly)
#Loading datasets
# student_data <- read.csv(file = 'Datasets/Student_data.csv')
# school_data <- read.csv(file = 'Datasets/School_data.csv')
# country_data <- read.csv(file = 'Datasets/Country_code_data.csv')
student_data <- load_student("all")
data(school)
school_data <- school
data(countrycode)
country_data <- countrycode
wrldmap <- map_data("world")
mrgddata <- merge(wrldmap,country_data,by.x="region",by.y="country_name")
mrgddata <- mrgddata[order(mrgddata$group,mrgddata$order),]
country_data_year_wise <- school_data |>
  group_by(country, year) |>
  select(country, year)
country_data_year_wise <- unique.data.frame(country_data_year_wise)
merge_country_year <- merge(country_data_year_wise,country_data,by.x="country",by.y="country")
mrgddata <- merge(wrldmap,merge_country_year,by.x="region",by.y="country_name")
mrgddata <- mrgddata[order(mrgddata$group,mrgddata$order),]
p <- ggplot(mrgddata) +
  aes(x=long,y=lat,group=group,frame = year) + geom_polygon() + aes(fill=region) +
  theme_dark() +
  labs(x="",y="",title="A view of participating countries")
fig <- ggplotly(p)
fig <- fig %>% 
  animation_opts(
    1000, easing = "elastic", redraw = FALSE
  )
fig <- fig %>% 
  animation_button(
    x = 1, xanchor = "right", y = 0, yanchor = "bottom"
  )
fig <- fig %>%
  animation_slider(
    currentvalue = list(prefix = "YEAR ", font = list(color="red"))
  )

fig
-1000100200-50050
regionAlbaniaArgentinaAustraliaAustriaBelgiumBrazilBulgariaCanadaChileDenmarkFinlandFranceGermanyGreeceHungaryIcelandIndonesiaIrelandIsraelItalyJapanLatviaLiechtensteinLuxembourgMexicoNetherlandsNew ZealandNorwayPeruPolandPortugalRomaniaRussiaSouth KoreaSpainSwedenSwitzerlandThailandYEAR 20002000200320062009201220152018A view of participating countriesPlay

Fig 1.1 World map particapting countries

The questions that are addressed from this dataset are as follows:

1. How the scores across maths, science and reading have changed over the years?

2. Is there a difference in performance between students who attend public or private schools?

3. Effects of the parent’s education of student’s scores?

4. Do students who have a higher escs score perform better in maths then those with lower scores?

For the analysis I am focusing mainly on the student dataset but merging student and school for the private VS public analysis and student and country for the participating counties graph. One of the packages we will be using is including tidyverse which contains other packages such as ggplot2 for graphing and dpylr for data manipulation and tidyr to organise the data. Plotly is used to make the graphs interactive. The analysis mostly consists of filtering and grouping the variables for comparison, often deriving the mean scores of the three subjects.
The student dataset covers years 2000 to 2018 as mentioned in the introduction, it is worth nothing that most missing data comes from the earlier years 2000/2003, the reason for this is unknown but could be due to issues rolling out the program, there is still a depth of information available but the years such as 2003 will be omitted when focusing on specific variables such as escs scores.
The variables within the subset are in two data types, factors and numeric, this is appropriate for analysis and doesn’t require changing, except when answering the 3rd question relating to escs scores a new variable made from transforming escs from numeric to a factor and then cutting into 4 groups will be used to show a tiered analysis. Most of the analysis will focus on filter and grouping the variables
The variable escs measure of socioeconomic status. It is based on three variables—parents’ highest occupational status, parents’ highest educational level, and home possessions—which are standardized and then averaged to an index. Home possessions are based on a set of 25 items which PISA subdivides further into four variables: wealth possessions, cultural possessions, home educational resources, and the number of books at home.
The variables “mother_educ” and “father_educ” are graded on the International Standard Classification of Education (ISCED) rankings. The lowest group in education would be those with “less the ISCED 1”, ISCED 1 is primary education starting at around 5 years old, with ISCED 3 being the highest and indicating Upper Secondary Education, within ISCED there are 3A, 3B and 3C with 3A being the highest subgroup. Those with education of less the ISCED 1 likely never attended formal education.

How have student’s scores changed over the years?

q1 <- student_data %>% 
  group_by(year) %>% 
  summarise(avgMath = mean(math, na.rm = T),
            avgSci = mean(science, na.rm = T),
            avgRead = mean(read, na.rm = T),
            avgEscs = mean(escs, na.rm = T),
            avgWealth = mean(wealth, na.rm = T))
DifferenceInMath<- diff(q1$avgMath)
DifferenceInScience <-  diff(q1$avgSci)
DifferenceInReading <- diff(q1$avgRead)
DifferenceInWealth <- diff(q1$avgWealth)
DifferenceInEscs <- diff(q1$avgEscs)
totalDiffs <- cbind(DifferenceInMath,DifferenceInScience,DifferenceInReading,
      DifferenceInWealth,DifferenceInEscs)
plotq1 <- q1 %>% 
  ggplot(aes(x = year))+
  geom_point(aes(y = avgMath, color = "Math"))+
  geom_line(aes(y = avgMath, color = "Math"), group = 1, lwd = .6)+
  geom_point(aes(y = avgSci, color = "Science")) +
  geom_line(aes(y = avgSci, color = "Science"), group = 1, lwd = .6)+
  geom_point(aes(y = avgRead, color = "Reading"))+
  geom_line(aes(y = avgRead, color = "Reading"), group = 1, lwd = .6)+
  ggtitle("Average scores over time")+
  theme(plot.title = element_text(hjust = 0.5))+
  xlab("Year")+
  ylab("Score")+
  guides(fill=guide_legend(title="Subject"))+ 
  labs(color = "Subject")
  
  
ggplotly(plotq1)
2000200320062009201220152018460470480
SubjectMathScienceReadingAverage scores over timeYearScore

Fig 3.1: Scatterplot average scores over time

Years are plotted along the x-axis with scores along the y-axis, a decreasing trend can be seen across all three subjects as the years increase.

Performance of students studying in private vs public schools.

year_wise_school_mean_scores <- student_data |>
  group_by(school_id, country, year) |>
  summarize(mean_math_scores = mean(math, na.rm=T), mean_science_scores = mean(science, na.rm=T), mean_read_scores = mean(read, na.rm=T))
student_school_data = merge(year_wise_school_mean_scores,school_data,by = intersect(names(year_wise_school_mean_scores), names(school_data)))
private_schools_mean_scores <- student_school_data |>
  filter(public_private == "private") |>
  group_by(country) |>
  summarize(mean_math_scores = mean(mean_math_scores, na.rm=T), mean_science_scores = mean(mean_science_scores, na.rm=T), mean_read_scores = mean(mean_read_scores, na.rm=T)) |>
  arrange(desc(mean_math_scores), desc(mean_science_scores), desc(mean_read_scores))
public_schools_mean_scores <- student_school_data |>
  filter(public_private == "public") |>
  group_by(country) |>
  summarize(mean_math_scores = mean(mean_math_scores, na.rm=T), mean_science_scores = mean(mean_science_scores, na.rm=T), mean_read_scores = mean(mean_read_scores, na.rm=T)) |>
  arrange(desc(mean_math_scores), desc(mean_science_scores), desc(mean_read_scores))
country_grp_scores <- student_school_data |>
  group_by(country) |>
  summarize_at(vars(mean_math_scores, mean_science_scores, mean_read_scores), funs(mean(.,na.rm = T)))
df <- merge(private_schools_mean_scores,public_schools_mean_scores, by.x = "country", by.y = "country")
df <- merge(df,country_grp_scores,by.x="country",by.y="country")
df <- df |>
  rename(Private.Mean.Math.Scores = mean_math_scores.x, Private.Mean.Science.Scores=mean_science_scores.x, Private.Mean.Read.Scores=mean_read_scores.x, Public.Mean.Math.Scores=mean_math_scores.y,Public.Mean.Science.Scores=mean_science_scores.y,Public.Mean.Read.Scores=mean_read_scores.y,Total.Mean.Math.Score = mean_math_scores, Total.Mean.Science.Score = mean_science_scores, Total.Mean.Read.Score = mean_read_scores)
df <- df |>
  mutate(MathDiff = ((df$Private.Mean.Math.Scores-df$Public.Mean.Math.Scores)/df$Total.Mean.Math.Score)*100,
                 ScienceDiff = (df$Private.Mean.Science.Scores-df$Public.Mean.Science.Scores)/df$Total.Mean.Science.Score*100,
                 ReadingDiff = (df$Private.Mean.Read.Scores-df$Public.Mean.Read.Scores)/df$Total.Mean.Read.Score*100,
                 Total=df$Total.Mean.Math.Score+df$Total.Mean.Science.Score+df$Total.Mean.Read.Score,
                 AverageDiff=(MathDiff+ScienceDiff+ReadingDiff)/3) |>
  arrange(desc(MathDiff), desc(ScienceDiff), desc(ReadingDiff))
new_df <- df[c(1:12,81:93),]
plt1 <- ggplot(new_df)+
  aes(x=reorder(country,MathDiff),y=MathDiff)+
  geom_bar(stat='identity', fill = "lightblue") +
  geom_hline(yintercept = mean(df$MathDiff),size=1,col="brown") +
  labs(x="Country Names",y="Math Score Diff %",title="Difference Percentage in Math - Private vs Public") + coord_flip() +
  theme_bw()
plt2 <- ggplot(new_df)+
  aes(x=reorder(country,ScienceDiff),y=ScienceDiff)+
  geom_bar(stat='identity', fill = "palevioletred2") +
  geom_hline(yintercept = mean(df$ScienceDiff),size=1,col="brown") +
  labs(x="Country Names",y="Science Score Diff %",title="Difference Percentage in Science - Private vs Public") + coord_flip() +
  theme_bw()
plt3 <- ggplot(new_df)+
  aes(x=reorder(country,ReadingDiff),y=ReadingDiff)+
  geom_bar(stat='identity', fill = "lightgreen") +
  geom_hline(yintercept = mean(df$ReadingDiff),size=1,col="brown") +
  labs(x="Country Names",y="Reading Score Diff %",title="Difference in mean scores (%) - Private vs Public") + coord_flip() +
  theme_bw()
dataf <- school_data |>
  group_by(year, public_private) |>
  summarize(public_private_count = n())
private <- dataf |>
  filter(public_private == "private")
public <- dataf |>
  filter(public_private == "public")
attach(dataf)
layout(matrix(c(1,1,2,3), 2, 2, byrow = TRUE),
   widths=c(3,1), heights=c(1,2))
plot <- ggplot(dataf) +
  aes(x=year,y=public_private_count, colour = public_private) +
  geom_point() +
  geom_line(size=1) +
  labs(x="Year",y="School Count",title = "Plotting the increase in public & private schools over the years")

subplot(plt1, plt2, plt3, shareX = TRUE, shareY = FALSE)
-10010203040TUNMUSTAPTTOIDNTHAVNMHKGLIEBLRSRBCHEARGMLTMKDBGRSVNPANQATURYKSVQVEBRAKGZ-100102030MUSTUNTAPTTOIDNVNMTHAHKGSRBLIECHEBLRMKDARGBGRKSVMLTSVNQATURYBRAPANQVEKGZ02040TUNTTOMUSTAPSRBVNMIDNTHAHKGLIECHEBLRMKDARGKSVBGRMLTSVNQATBRAPANURYQVEKGZ
Difference in mean scores (%) - Private vs PublicMath Score Diff %Science Score Diff %Reading Score Diff %

Fig 3.2: Difference in mean scores of Public vs Private

plot
Fig 3.2B: Increase in schools over years

Fig 3.2B: Increase in schools over years

math_t_test <- t.test(df$Private.Mean.Math.Scores, df$Public.Mean.Math.Scores,alternative="greater")
science_t_test <- t.test(df$Private.Mean.Science.Scores, df$Public.Mean.Science.Scores,alternative="greater")
read_t_test <- t.test(df$Private.Mean.Read.Scores, df$Public.Mean.Read.Scores,alternative="greater")
cat("P-value of difference in means of Private and Public samples of math : ", math_t_test$p.value, "\nSample estimate of mean of Private school student scores : ", math_t_test$estimate[[1]], ", Sample estimate of mean of Private school student scores : ", math_t_test$estimate[[2]],"\nP-value of difference in means of Private and Public samples of science : ", science_t_test$p.value,"\nSample estimate of mean of Private school student scores : ", science_t_test$estimate[[1]], ", Sample estimate of mean of Private school student scores : ", science_t_test$estimate[[2]],"\nP-value of difference in means of Private and Public samples of read : ", read_t_test$p.value,"\nSample estimate of mean of Private school student scores : ", read_t_test$estimate[[1]], ", Sample estimate of mean of Private school student scores : ", read_t_test$estimate[[2]])
## P-value of difference in means of Private and Public samples of math :  0.0003508512 
## Sample estimate of mean of Private school student scores :  474.3474 , Sample estimate of mean of Private school student scores :  442.2904 
## P-value of difference in means of Private and Public samples of science :  0.0001497735 
## Sample estimate of mean of Private school student scores :  476.977 , Sample estimate of mean of Private school student scores :  444.4084 
## P-value of difference in means of Private and Public samples of read :  7.545733e-05 
## Sample estimate of mean of Private school student scores :  470.7946 , Sample estimate of mean of Private school student scores :  436.3268

The above plot shows the mean difference in scores of students from Private and Public schools. The x-axis is the difference between private and public schools for each subject (maths, science and reading), while countries are plotted along the y-axis. The mean of difference between private and public student scores for math, science & read are 7.60, 7.66 and 8.36 respectively which is positive and says that the private schools are performing better than public schools for all three subjects.

Effects of the parent’s education of student’s scores?

x<- student_data %>% group_by(year,country,mother_educ,father_educ,gender)%>% 
  filter(year!=2000)%>% summarise(subject1=mean(math,na.rm=TRUE),
  subject2=mean(science,na.rm=TRUE),subject3=mean(read,na.rm=TRUE))
y<-x %>% pivot_longer(cols =c('mother_educ','father_educ'),
                      names_to = 'mother_father_educ',values_to = 'Levels')
z<- y%>% pivot_longer(cols = c('subject1','subject2','subject3'),
                      names_to = 'subject',values_to = 'score')
mplot <- ggplot(data=z) + 
  geom_histogram(mapping=aes(x=score,color=mother_father_educ))+ 
  facet_wrap(vars(Levels))+
  ggtitle("Parent's education by combined average score")+
  theme(plot.title = element_text(hjust = 0.5))+
  xlab("Score")+
  ylab("Count")+
  labs(color = "Parent's Education")
splot <- ggplot(data=z) + 
  geom_histogram(mapping=aes(x=score,color=mother_father_educ,fill=gender))+ 
  facet_wrap(vars(Levels))+
  xlab("Score")+
  ylab("Count")+
  labs(color = "Parent's Education")

mplot
Fig 3.4: Parent's education grouped vs combined average score

Fig 3.4: Parent’s education grouped vs combined average score

splot
Fig 3.4: Parent's education grouped vs combined average score by gender

Fig 3.4: Parent’s education grouped vs combined average score by gender

The graph depicts the relationship between average values of all three subjects based on their mother and father education qualification across all years. From the plot we can observe that if the mother and father have a good education qualification then the result of the student has been significantly increased in all the corresponding years. The year 2000 is omitted due to NA in the parent’s education variables.

Do students who have a higher escs score perform better in maths then those with lower scores?

q3 <- student_data %>% 
  select(math, science, read, wealth, escs, year)
q3$escsGroup <- as.factor(cut(q3$escs, 4, labels = c("T 1", 
                                                      "T 2", 
                                                      "T 3", 
                                                      "T 4")))
q3C <- q3 %>%
  filter(year != 2003) %>% 
  filter(!is.na(escs)) %>% 
  group_by(year,escsGroup) %>% 
  summarise(avgMath = mean(math, na.rm = T),
            avgEscsPerGroup = mean(wealth, na.rm = T))
q3plot <- q3C %>% 
  ggplot(aes(escsGroup, avgMath, fill = avgMath))+
  geom_bar(stat = "identity")+
  facet_wrap(~year) +
  coord_cartesian(ylim=c(300,550))+
  ggtitle("Average maths scores by escs tier")+
  theme(plot.title = element_text(hjust = 0.5))+
  xlab("Tier")+
  ylab("Math Score")
 
ggplotly(q3plot)
300350400450500550T 1T 2T 3T 4300350400450500550T 1T 2T 3T 4T 1T 2T 3T 4
350400450500avgMathAverage maths scores by escs tierTierMath Score200020062009201220152018

Fig 3.3: Mean maths score by escs tier

Tiers 1-4 are plotted along the x-axis while the mean maths score is plotted along the y-axis. The Tiers arrange from T1 being the lowest escs to T4 being the highest. The graph shows that higher tiers perform better in maths compared to the lower tiers.

Question 1

Investigated how the scores have changed over the years, the average score of each subject has decreased over the years. Science had remained as the highest scoring area until 2018 where math became the highest scoring subject. Comparing highest to lowest scores, science has fallen 25 points, maths 21 and reading also 25. Since 2003 there has only been one growth period that was 2012.

Question 2

Compared the performances of private schools to public schools, across all three subjects’ private schools performed better than their public counterparts. Looking at the statistical data he p-values and estimates from t-test state that the difference in means of Private and Public samples of math, science and read scores is greater than zero and that we can say that the private school students perform better than public school students.
The reasoning for private schools outperforming public is likely due to multiple factors, since private schools often do not rely on government funding, they are not likely to suffer cutbacks, private schools are often smaller resulting in class sizes with better ratios of teacher to students and many private schools require students to pass an admission test to enter resulting in students having a good level of academic knowledge.

Question 3.

Looks to examine the effects the parents’ education has a on the students combined scores. The parents’ education is critical factor as parents with higher level of education general have better paying jobs allowing their children to access to other influential factors, also parents in countries with good education programs it is likely their children will also attend these programs.

Question 4.

Examined if students with a higher escs score perform better in maths. Since escs is an index combing several factors that are within the student subset the analysis of this variable it is a valuable asset to see its relationship to students’ scores. Those students with a higher escs score performed better in maths, likely for the same reason those who attend private schools perform better, access to higher quality education, parents that have good education and high income and likely the student can be in education fulltime rather than having a job to help support their family. The top tiers (T3-T4) perform much better than the lower tier counterparts.